Model Report for Baseball Seasons
Generated on 15 Apr 2025, 00:03 ● 19,689 original samples, 19,494 synthetic samples
|
Accuracy
89.8%
(98.0%)
|
|
|
Similarity
|
|
|
Distances
|
|
Correlations
Univariate Distributions
Bivariate Distributions
Coherence: Auto-correlations
Coherence: Sequences per Distinct Category
Coherence: Distinct Categories per Sequence
Accuracy
| Column | Univariate | Bivariate | Coherence |
|---|---|---|---|
| Sequence Length | 94.8% | 87.9% | - |
| team | 93.8% | 88.2% | 94.1% |
| year | 92.9% | 86.3% | 93.7% |
| G | 90.5% | 86.7% | 91.4% |
| AB | 89.8% | 84.7% | 90.9% |
| HR | 89.0% | 87.3% | 88.0% |
| H | 88.9% | 85.2% | 90.5% |
| Total | 91.4% | 86.6% | 91.4% |
Explainer
Accuracy of synthetic data is assessed by comparing the distributions of the synthetic (shown in green) and the original data (shown in gray).
For each distribution plot we sum up the deviations across all categories, to get the so-called total variation distance (TVD). The reported accuracy is then simply reported as 100% - TVD.
These accuracies are calculated for all univariate and bivariate distributions. A final accuracy score is then calculated as the average across all of these.
Similarity
Explainer
These plots show the first 3 principal components of training samples, synthetic samples, and (if available) holdout samples within the embedding space. The black dots visualize the centroids of the respective samples.
The similarity metric then measures the cosine similarity between these centroids. We expect the cosine similarity to be close to 1, indicating that the synthetic samples are as similar to the training samples as the holdout samples are.
Distances
| Synthetic vs. Training Data | Synthetic vs. Holdout Data | Training vs. Holdout Data | |
| Identical Matches | 0.1% | 0.2% | 0.3% |
| Average Distances | 0.305 | 0.305 | 0.291 |
| DCR Share | 50.3% |
Explainer
Synthetic data shall be as close to the original training samples, as it is close to original holdout samples, which serve us as a reference.
This can be asserted empirically by measuring distances between synthetic samples to their closest original samples, whereas training and holdout sets are sampled to be of equal size.
A green line that is significantly left of the dark gray line implies that synthetic samples are closer to the training samples than to the holdout samples, indicating that the data has overfitted to the training data.
A green line that overlays with the dark gray line validates that the trained model indeed represents the general rules, that can be found in training just as well as in holdout samples.
The DCR share indicates the proportion of synthetic samples that are closer to a training sample than to a holdout sample, and ideally, this value should not significantly exceed 50%, as a higher value could indicate overfitting.